# Information Theory ## Self information It converts probability to bits by:$I(X) = -\log_2{p}$ for a event $X$ happen with probability $p$. Note - $I(X)>0$ since $p\leqslant 1$. - An event with smaller probability has more information(bits) if happen. - We can interpret the self-information ($-\log(p)$) as the amount of *surprise* we have at seeing a particular outcome. ## Entropy For any random variable $X$ that follows a probability distribution $P$ with a probability density function (p.d.f.) or a probability mass function (p.m.f.) $p(x)$, we measure the expected amount of information through ***Entropy*** (or *Shannon entropy*) $$H(X) = - E_{x \sim P_X} [\log p(x)].$$ ### Joint Entropy Similar to entropy of a single random variable, we define the *joint entropy* $H(X, Y)$ of a pair random variables $(X, Y)$ as $$H(X, Y) = −E_{(x, y) \sim P_{X, Y}} [\log p_{X, Y}(x, y)]. $$